18 research outputs found

    Concepts and Effectiveness of the Cover Coefficient Based Clustering Methodology for Text Databases

    Get PDF
    An algorithm for document clustering is introduced. The base concept of the algorithm, Cover Coefficient (CC) concept, provides means of estimating the number of clusters within a document database. The CC concept is used also to identify the cluster seeds, to form clusters with the seeds, and to calculate Term Discrimination and Document Significance values (TDV, DSV). TDVs and DSVs are used to optimize document descriptions. The CC concept also relates indexing and clustering analytically. Experimental results indicate that the clustering performance in terms of the percentage of useful information accessed (precision) is forty percent higher, with accompanying reduction in search space, than that of random assignment of documents to clusters. The experiments have validated the indexing-clustering relationships and shown improvements in retrieval precision when TDV and DSV optimizations are used

    Experiments on Tunable Indexing

    Get PDF
    The effectiveness and efficiency of an Information Retrieval (IR) system depends on the quality of its indexing system. Indexing con be used in inverted file systemsor in cluster-based retrieval. In this article, new concept called tunable indexing is introduced. With tunable indexing the number of clusters of a document clustering system can be varied to any desired value. Also covered are the computation of Term Discriminarion Value(TDV) with the cover coefficienr (CC) concepts and its use in tunable indexing. A set of experiments has slown the consistency between the CC based TDYs and the TDYs determined with the known methods. The main use of turnable indexing has been observed in determining the parameters of a clustering system

    and

    No full text
    A new algorithm for document clustering is introduced. The base concept of the algorithm, the cover coefficient (CC) concept, provides a means of estimating the number of clusters within a document database and relates indexing and clustering analytically. The CC concept is used also to identify the cluster seeds and to form clusters with these seeds. It is shown that the complexity of the clustering process is very low. The retrieval experiments show that the information-retrieval effectiveness of the algorithm is compatible with a very demanding complete linkage clustering method that is known to have good retrieval performance. The experiments also show that the algorithm is 15.1 to 63.5 (with an average of 47.5) percent better than four other clustering algorithms in cluster-based information retrieval. The experiments have validated the indexing-clustering relationships and the complexity of the algorithm and have shown improvements in retrieval effectiveness. In the expe; ments, two document databases are used: TODS214 and INSPEC. The latter is a common database with 12,684 documents

    A Spatial Grid File For Multimedia Data Representation

    No full text
    In multimedia databases spatial or high-dimensional data manipulation is important for storage and retrieval. In this study, we introduce a new file structure called Spatial Grid File. This file enables us to index data objects by different and independent high-dimensional attributes. And, with it, well-known spatial query types, such as range queries, nearest neighbor queries and spatial join operations can be efficiently performed. Although the performance of the Spatial-- Grid file structure is based on the indexing methods used, it has a unique feature of combining set of spatial data each having different properties. Furthermore, this file structure is very suitable for parallelization

    Scheduling parallel programs involving parallel database interactions

    No full text
    © Springer-Verlag Berlin Heidelberg 1997.In this study, we develop a new static scheduling scheme which integrates parallel programming environments with parallel database systems to optimize program execution. In parallel programming, a sequential program is first converted to a task graph either with programmer guidance or by a restructuring compiler. Next, a scheduling algorithm assigns the nodes of the task graph to processors. However a question arises when some tasks have to access a parallel database system. Our scheme extends static list scheduling approach to efficiently execute database accesses of parallel programs. To handle database interaction, input task graph is regenerated indicating task(s) with database interaction and these tasks are modified according to parallel database system characteristics (such as query type, database partitioning knowledge etc. ). Then the proposed algorithm runs on the expanded graph by using a new heuristic which is referred to as DTF (Database Task First). To prove the usefulness of our scheme we take the TPM (Task-to Processor Mapping) algorithm as the base. As a first step, TPM algorithm is modified to work on expanded task graph and highest priorities are always given to database tasks. Then some task graphs with database interaction are scheduled with the original TPM using the earliest task first heuristic and compared with the modified version. Early simulation results show that considering database interaction during scheduling and DTF heuristics drastically improve scheduling performance. It is also important to note that our scheme is not dependent on any specific scheduling algorithm and any algorithm can be modified to handle database interactions in the way proposed in this study

    ABSTRACT

    No full text
    Partitioning by clustering of very large databases is a necessity to reduce the spaze/time complexity of retrieval operations. How-ever, the contemporary and modern retrieval environments demand dynamic maintenance of clusters. A new cluster mainte-nance strategy is proposed and its similarity/stability characteris-tics, cost analysis, and retrieval behavior in comparison with unclustered and completely reclustered database environments have been examined by means of a series of experiments. I

    BLOCKER: A variable & multiattribute declustering for parallel database machines

    No full text
    © 1996 Springer Verlag. All rights reserved.In this paper, a new dechstering approach is introduced which uses an analytical workload model to determine the number of processors for each relation to execut~ the given workload efficianfly. The number of processors determined is used in generating load balanced dechstering based on a well-known multiattribute file su’ucture [4]. The result rams into a parallel file structure caLled PARMA [5]. The overall system, therefore, combines all advantages of variable and mulfiattributo dechstering together with efficient processor mapping [2] and the underlying parallel file support
    corecore